\[\\[0in]\]

Abstract

This research project showcases an overview of data analysis methods associated with the single-cell RNA sequencing pipeline, with cell clustering maps as the final product. Using Bioconductor and R, this project is based on raw data from a paper focused on identifying respiratory immune characteristics associated with severe acute respiratory syndrome (SARS-CoV-2), through scRNA-Seq analysis of immune cells from bronchoalveolar lavage fluid. This project begins with a walkthrough of preprocessing steps including data unpacking, metadata alignment, quality control, and normalization to help reduce skew in data. This follows with dimensionality reduction plots in three forms: Principal Component Analysis (PCA), T-Distributed Stochastic Neighbor Embedding (T-SNE), and Uniform Manifold Approximation and Projection (UMAP). Using these dimensionality reduction plots, two forms of clustering are demonstrated: graph-based and k-means clustering. Based on these plots, a significant difference can be noted when comparing macrophage expression profiles from COVID-19 patients and the control group: a distinct, large group of macrophages associated with severe COVID-19 patients is apparent. This observation supports the finding in the original paper (Liao et. al.) which states that proinflammatory monocyte-derived macrophages were abundant in lavage fluid from patients with severe COVID-19.

\[\\[0in]\]

Overview

In this research project, I aim to provide an overview of data analysis associated with scRNA-Seq (single-cell RNA sequencing), by creating cell clustering maps.

scRNA-Seq

scRNA-Seq measures the activity of thousands of genes within single cells. It practically accomplishes this by measuring the transcriptome (the sum of mRNA or messenger RNA transcripts) of many single cells, usually from one tissue.

The main goal of scRNA-Seq is to “characterize heterogeneity across cells” by measuring gene expression differences between single cells.

By allowing us to cell-specific expression differences, scRNA-Seq gives us detailed information that is useful for a number of applications. For example, we can identify gene expression patterns between individuals who have and do not have a certain disease.

The Dataset

This project uses raw data from the following paper, titled “Single-cell landscape of bronchoalveolar immune cells in patients with COVID-19.”

This paper was written by Mingfeng Liao, Yang Liu, Jing Yuan, and several others from the Institute of Hepatology in the National Clinical Research Center for Infectious Disease at Shenzhen Third People’s Hospital. It was published on May 12, 2020.

The focus of this paper was identifying respiratory immune characteristics associated with severe acute respiratory syndrome (SARS)-CoV2, through scRNA-Seq analysis of immune cells from broncheoalveolar lavage fluid. This fluid was collected from patients with varying severity of COVID-19 as well as healthy controls.

Bronchoalveolar lavage fluid (BALF) is fluid collected from a lung segment, through a diagnostic technique called bronchoscopy. In bronchoscopy, fluid is squirted into a small part of the lung and is promptly recollected for analysis. After it is collected, BALF mainly contains alveoli secretions (where gas exchange takes place). A typical application for BALF is diagnosing lung disease.

As COVID-19 is a respiratory disease, performing scRNA-Seq on cells from BALF can yield insights into how this disease affects gene expression of our cells. This paper explicitly states that it examines immune cells. Immune cells have multiple subtypes including macrophages, T cells, and B cells – it is important to distinguish between these cell types, as each type peforms a different function.

Note: This paper has raw data from 4 controls and 9 people with COVID-19. For the purposes of this project, I decided to only analyze 3 controls because performing cell alignment on the fourth control (with data from a separate source) caused Excel to crash multiple times.

\[\\[0in]\]

Following Along

If you want to follow along, go to the GitHub repository for this project. Make sure to download the repository and unzip it - it contains a folder called “scRNA-Seq-with-COVID-19-Data-master”.

After that, do the following:

  1. Download and install R.
  2. Download and install RStudio.
  3. Download and install 7Zip as it will easily allow to handle .tar.gz files. If you are on macOS, Keka is a great alternative.

Once you have installed R and RStudio, go to the GitHub respository download folder, “scRNA-Seq-with-COVID-19-Data-master”.

There, open the R Project file “scrna-seq-covid-project” in RStudio. Create a new R Script/R Markdown document (name it “my-r-file”) and save it in “scRNA-Seq-with-COVID-19-Data-master”. This is roughly what your main folder should look like:

In R (specifically, in “my-r-file”), install Bioconductor and devtools.

Packages

Make sure to install all of the following packages before starting Step 1 (you can do this by typing install.packages("<package name>") and BiocManager::install("<package name>")into the R console for CRAN packages and Bioconductor packages, respectively):

  • Seurat (through CRAN)
  • SingleCellExperiment (through Bioconductor)
  • scater (through Bioconductor)
  • scran (through Bioconductor)
  • cluster (through CRAN)
  • PCAtools (through Bioconductor)
  • hdf5r (through CRAN) – if prompted, make sure to say “no” to installing from sources

Make sure to say “yes” to updates and additional software that is being downloaded along with these packages.

Call these packages:

library(Seurat)
library(SingleCellExperiment)
library(scater)
library(scran)
library(cluster)
library(PCAtools)
library(hdf5r)

\[\\[0in]\]

Step 1: Preprocessing

Data Loading

The data we will be working with is an intermediate product of scRNA-Seq – it has already through the initial data processing stages (genome alignment). However, what is most important to know is that our data will contain genes as the rows and cells (from various samples) as the columns.

Our data is listed in NCBI’s Gene Expression Omnibus (GEO) under the accession code GSE145926.

Once you have headed to the site, download the supplementary file (at the bottom of the page), “GSE145926_RAW.tar”.

Extract the .tar file and go to the folder with its contents. There, you will find many .h5 and .gz files.

Separate the 12 .h5 files from the 9 .gz files. Put the .h5 files in a separate folder (you can name this folder “count-matrices”): we do not need the .gz files for this project. Place the folder with the .h5 files in “scRNA-Seq-with-COVID-19-Data-master”. This is what your “count-matrices” folder should look like:

Go back to “my-r-file” (in RStudio) and set your working directory to “count-matrices”. You can do this by clicking “Set Working Directory” under the Session tab in RStudio.

Once that is done, enter the following commands:

c51 <- Read10X_h5("GSM4475048_C51_filtered_feature_bc_matrix.h5")
c52 <- Read10X_h5("GSM4475049_C52_filtered_feature_bc_matrix.h5")
c100 <- Read10X_h5("GSM4475050_C100_filtered_feature_bc_matrix.h5")
c141 <- Read10X_h5("GSM4339769_C141_filtered_feature_bc_matrix.h5")
c142 <- Read10X_h5("GSM4339770_C142_filtered_feature_bc_matrix.h5")
c143 <- Read10X_h5("GSM4339771_C143_filtered_feature_bc_matrix.h5")
c144 <- Read10X_h5("GSM4339772_C144_filtered_feature_bc_matrix.h5")
c145 <- Read10X_h5("GSM4339773_C145_filtered_feature_bc_matrix.h5")
c146 <- Read10X_h5("GSM4339774_C146_filtered_feature_bc_matrix.h5")
c148 <- Read10X_h5("GSM4475051_C148_filtered_feature_bc_matrix.h5")
c149 <- Read10X_h5("GSM4475052_C149_filtered_feature_bc_matrix.h5")
c152 <- Read10X_h5("GSM4475053_C152_filtered_feature_bc_matrix.h5")

These commands help us read the .h5 files, which are not readable by any default Windows application. These files are the foundation of our count matrix, the raw data that shows gene expression levels for each cell.

Cell Alignment

There is one problem with this data: many cells do not have complementary metadata (in this case, cell metadata refers to cell-specific attributes, including the cell type and which person/group the cell came from). With no or incomplete metadata, it is much harder to generate meaningful analyses later onward.

To resolve this, cells without metadata must be removed.

Set your working directory to the “cell-alignment” folder in “scRNA-Seq-with-COVID-19-Data-master” (it contains 12 .csv files starting with “c100numbers.csv”). The following code removes the cells that do not have associated metadata in the individual c objects (c51, c52, etc.).

c51.keep <- read.csv("c51numbers.csv", header=FALSE)
c51.vect <- c()
for (i in c51.keep){c51.vect <- append(c51.vect, i)}
c51.rev <- c51[,-c51.vect] #c51

c52.keep <- read.csv("c52numbers.csv", header=FALSE)
c52.vect <- c()
for (i in c52.keep){c52.vect <- append(c52.vect, i)}
c52.rev <- c52[,-c52.vect] #c52

c100.keep <- read.csv("c100numbers.csv", header=FALSE)
c100.vect <- c()
for (i in c100.keep){c100.vect <- append(c100.vect, i)}
c100.rev <- c100[,-c100.vect] #c100

c141.keep <- read.csv("c141numbers.csv", header=FALSE)
c141.vect <- c()
for (i in c141.keep){c141.vect <- append(c141.vect, i)}
c141.rev <- c141[,-c141.vect] #c141

c142.keep <- read.csv("c142numbers.csv", header=FALSE)
c142.vect <- c()
for (i in c142.keep){c142.vect <- append(c142.vect, i)}
c142.rev <- c142[,-c142.vect] #c142

c144.keep <- read.csv("c144numbers.csv", header=FALSE)
c144.vect <- c()
for (i in c144.keep){c144.vect <- append(c144.vect, i)}
c144.rev <- c144[,c144.vect] #c144

c143.keep <- read.csv("c143numbers.csv", header=FALSE)
c143.vect <- c()
for (i in c143.keep){c143.vect <- append(c143.vect, i)}
c143.rev <- c143[,-c143.vect] #c143

c145.keep <- read.csv("c145numbers.csv", header=FALSE)
c145.vect <- c()
for (i in c145.keep){c145.vect <- append(c145.vect, i)}
c145.rev <- c145[,-c145.vect] #c145

c146.keep <- read.csv("c146numbers.csv", header=FALSE)
c146.vect <- c()
for (i in c146.keep){c146.vect <- append(c146.vect, i)}
c146.rev <- c146[,-c146.vect] #c146

c148.keep <- read.csv("c148numbers.csv", header=FALSE)
c148.vect <- c()
for (i in c148.keep){c148.vect <- append(c148.vect, i)}
c148.rev <- c148[,-c148.vect] #c148

c149.keep <- read.csv("c149numbers.csv", header=FALSE)
c149.vect <- c()
for (i in c149.keep){c149.vect <- append(c149.vect, i)}
c149.rev <- c149[,-c149.vect] #c149

c152.keep <- read.csv("c152numbers.csv", header=FALSE)
c152.vect <- c()
for (i in c152.keep){c152.vect <- append(c152.vect, i)}
c152.rev <- c152[,-c152.vect] #c152

Count Matrix Construction

Now that we have streamlined the number of cells in each sample, we need to merge the samples into a large count matrix with rows as genes and columns as individual cells.

To do this, all of our samples must have the same number of rows and columns. We have to remove the last row from the following samples to make this possible (this has no negative effect on our analyses):

c141.rev.2 <- c141.rev[-nrow(c141.rev),]
c142.rev.2 <- c142.rev[-nrow(c142.rev),]
c143.rev.2 <- c143.rev[-nrow(c143.rev),]
c144.rev.2 <- c144.rev[-nrow(c144.rev),]
c145.rev.2 <- c145.rev[-nrow(c145.rev),]
c146.rev.2 <- c146.rev[-nrow(c146.rev),]
c148.rev.2 <- c148.rev[-nrow(c148.rev),]
c149.rev.2 <- c149.rev[-nrow(c149.rev),]
c152.rev.2 <- c152.rev[-nrow(c152.rev),]

We will merge all of these files into a large count matrix, “dgcmatrix.combined”. Make sure to add in c144 before c143 due to the order of the metadata:

dgcmatrix.combined <- cbind(c51.rev,c52.rev,c100.rev)
dgcmatrix.combined <- cbind(dgcmatrix.combined, c141.rev.2, c142.rev.2)
dgcmatrix.combined <- cbind(dgcmatrix.combined, c144.rev.2, c143.rev.2)
dgcmatrix.combined <- cbind(dgcmatrix.combined, c145.rev.2, c146.rev.2)
dgcmatrix.combined <- cbind(dgcmatrix.combined, c148.rev.2, c149.rev.2)
dgcmatrix.combined <- cbind(dgcmatrix.combined, c152.rev.2)

\[\\[0in]\]

Step 2: Metadata to SingleCellExperiment

Now, we can create an intermediate Seurat object, which allows us to easily add our metadata along with our count matrix. A Seurat object is essentially one type of container for storing single-cell data.

Before running the following code, change your working directory back to “scRNA-Seq-with-COVID-19-Data-master”, the folder you downloaded from GitHub.

metadata.seu <- read.csv("cell.metadata.csv", row.names=1)
seu <- CreateSeuratObject(counts=dgcmatrix.combined, assay="RNA", meta.data=metadata.seu) 

We can view the metadata to confirm that it has loaded properly:

seu@meta.data
##                       orig.ident nCount_RNA nFeature_RNA sample sample_new
## AAACCTGAGACACTAA-1 SeuratProject       5228         1635    C51        HC1
## AAACCTGAGGAGTACC-1 SeuratProject       5072         1386    C51        HC1
## AAACCTGAGGATATAC-1 SeuratProject       5531         1729    C51        HC1
## AAACCTGAGGTCATCT-1 SeuratProject       5630         1639    C51        HC1
## AAACCTGCACGGATAG-1 SeuratProject       7358         1722    C51        HC1
##                    group disease hasnCoV cluster    celltype
## AAACCTGAGACACTAA-1    HC       N       N       3 Macrophages
## AAACCTGAGGAGTACC-1    HC       N       N       3 Macrophages
## AAACCTGAGGATATAC-1    HC       N       N       3 Macrophages
## AAACCTGAGGTCATCT-1    HC       N       N       3 Macrophages
## AAACCTGCACGGATAG-1    HC       N       N       5 Macrophages
##  [ reached 'max' / getOption("max.print") -- omitted 63098 rows ]

Seurat to SingleCellExperiment

Another type of container for storing single-cell data is the SingleCellExperiment object, which we will be using to further analyze and explore our data.

Fortunately, we can easily convert our Seurat object to a SingleCellExperiment object:

sce <- as.SingleCellExperiment(seu)

We can view sce:

sce
## class: SingleCellExperiment 
## dim: 33538 63103 
## metadata(0):
## assays(2): counts logcounts
## rownames(33538): MIR1302-2HG FAM138A ... AC213203.1 FAM231C
## rowData names(0):
## colnames(63103): AAACCTGAGACACTAA-1 AAACCTGAGGAGTACC-1 ...
##   TTTGTCAGTCTGCAAT-1 TTTGTCAGTGTTGGGA-1
## colData names(11): orig.ident nCount_RNA ... celltype ident
## reducedDimNames(0):
## altExpNames(0):

\[\\[0in]\] To save our RAM, we can remove the various objects that we are not using anymore from our memory:

rm(dgcmatrix.combined)
rm(seu)
rm(c100)
rm(c100.keep)
rm(c100.rev)
rm(c141)
rm(c141.keep)
rm(c141.rev)
rm(c141.rev.2)
rm(c142)
rm(c142.keep)
rm(c142.rev)
rm(c142.rev.2)
rm(c143)
rm(c143.keep)
rm(c143.rev)
rm(c143.rev.2)
rm(c144)
rm(c144.keep)
rm(c144.rev)
rm(c144.rev.2)
rm(c145)
rm(c145.keep)
rm(c145.rev)
rm(c145.rev.2)
rm(c146)
rm(c146.keep)
rm(c146.rev)
rm(c146.rev.2)
rm(c148)
rm(c148.keep)
rm(c148.rev)
rm(c148.rev.2)
rm(c149)
rm(c149.keep)
rm(c149.rev)
rm(c149.rev.2)
rm(c152)
rm(c152.keep)
rm(c152.rev)
rm(c152.rev.2)
rm(c51)
rm(c51.keep)
rm(c51.rev)
rm(c52)
rm(c52.keep)
rm(c52.rev)

\[\\[0in]\]

Step 3: Data Operations

We have created a SingleCellExperiment object, which is ready for further data processing.

Quality Control

First, we will run quality control (QC) on our data. This ensures that we remove low-quality cells in our sce object. If present, low-quality cells often distort the results we obtain from later steps in scRNA-Seq analysis. For instance, such cells can form distinct clusters that do not reflect actual patterns in gene expression.

A common type of low quality cells are those with few expressed genes (in other words, scRNA-Seq failed to capture gene expression diversity for those cells). Another common source of low-quality cells are those with high proportions of mitochondrial RNA.

We can start quality control by identifying mitochondrial genes:

location <- rowRanges(sce)
is.mito <- any(seqnames(location)=="MT")

We can now add QC statistics for each cell to sce:

df <- perCellQCMetrics(sce, subsets=list(Mito=is.mito))
df

Based on these statistics, we can identify what cells we will discard. We are discarding outlier cells (generally calculated by being more than 3 median absolute deviations from the median) when it comes to library sizes, expressed genes, and mitochondrial RNA:

qc.lib <- isOutlier(df$sum, log=TRUE, type="lower")
qc.nexprs <- isOutlier(df$detected, log=TRUE, type="lower")
qc.mito <- isOutlier(df$subsets_Mito_percent, type="higher")

discard <- qc.lib | qc.nexprs | qc.mito

We can view a summary of these outliers:

DataFrame(LibSize=sum(qc.lib), NExprs=sum(qc.nexprs), MitoProp=sum(qc.mito), Total=sum(discard))
## DataFrame with 1 row and 4 columns
##     LibSize    NExprs  MitoProp     Total
##   <integer> <integer> <integer> <integer>
## 1         0         2         0         2

Now, these outliers can be removed from our sce object:

sce <- sce[,!discard]
sce
## class: SingleCellExperiment 
## dim: 33538 63101 
## metadata(0):
## assays(2): counts logcounts
## rownames(33538): MIR1302-2HG FAM138A ... AC213203.1 FAM231C
## rowData names(0):
## colnames(63101): AAACCTGAGACACTAA-1 AAACCTGAGGAGTACC-1 ...
##   TTTGTCAGTCTGCAAT-1 TTTGTCAGTGTTGGGA-1
## colData names(11): orig.ident nCount_RNA ... celltype ident
## reducedDimNames(0):
## altExpNames(0):

Normalization and Feature Selection

We can now log-transform our expression counts. Log-transformation typically reduces skew in data, helping it conform better to a normal distribution:

sce <- logNormCounts(sce)

Once we have normalized our data, we can move on to feature selection (gene selection). This is typically done by selecting highly variable genes. Selecting the most variable genes can help us better explore cell expression differences (as supposed to selecting all genes, some of which are cell-specific and tell us little about the whole dataset).

Before we select our highly variable genes (HVGs), we need to model gene variation:

dec <- modelGeneVar(sce)

We can now select our top HVGs. I selected a numeric value (instead of a proportional value) because it allowed me to directly control the number of genes. This was because I wanted to make sure computation down the line was not too intensive.

hvg <- getTopHVGs(dec, n=1000)

\[\\[0in]\]

Step 4: Dimensionality Reduction

Suppose our data only had two genes, “Gene A” and “Gene B” (but the same amount of cells). We could create a representation where each axis represents the expression of one gene.

Consequently, we could plot the 60000 plus cells in our data in those two dimensions based on the expression levels of “Gene A” and “Gene B”.

Now, scale this representation up to the number of genes we have in our data: 33538. This would mean that our representation contains 33538 dimensions.

Unfortunately, we cannot interpret a plot with 33538 dimensions. To remedy this, we must reduce the number of dimensions. This is done in many ways, including compression of multiple genes into a single dimension. The major risk when reducing dimensions is loss of biological insights (in exchange for a more interpretable plot).

We will explore 3 methods of dimensionality reduction in this project: PCA, T-SNE, and UMAP.

Principal Component Analysis (PCA)

One method of dimensionality reduction is Principal Component Analysis, or PCA.

From the high dimensional space, PCA essentially finds out the axes that capture the largest amount of variation among cells. These axes are called principal components (PCs). Practically, the top PCs capture a large amount of expression differences (variance) so there are diminishing returns once a certain amount of PCs are selected.

To start PCA analysis, we must set a seed. This is important because it allows us to reproduce our results, even with many algorithms that are based on randomization.

set.seed(1234)

Now that we have set the seed, we can run PCA on our sce object:

sce <- runPCA(sce, subset_row=hvg)

We can now perform some diagnostics on our PCA results, informing us on how to adjust the number of PCs:

percent.var <- attr(reducedDim(sce),"percentVar")
chosen.elbow <- findElbowPoint(percent.var)
chosen.elbow
## [1] 5

Our previous code identifies the elbow in the curve of a scree plot (with percent of variance explained in the y-axis and number of PCs in the x-axis). The elbow point measures the number of PCs where a significant portion of the gene variation is explained. Practically, adding more PCs past the elbow point leads to diminishing returns.

We can view the elbow point (5 PCs) graphically below:

plot(percent.var, xlab="PC", ylab="Variance explained (%)")
abline(v=chosen.elbow, col="red")

We can accordingly subset our PCA results:

reducedDim(sce, "PCA") <- reducedDim(sce, "PCA")[,1:5]

Now, we can plot the results of our PCA results, colored by cell type:

plotPCA(sce, colour_by = "celltype")

This graph focuses on all of the cells in the sce object. It is apparent that similar cell types have similar expression profiles: this results in distinct groupings of green, blue, and orange dots, just to name a few colors.

To obtain targeted graphs, we can plot subsets of our data:

sce.severe <- sce[, sce$group == "S"] #subset indicating cells from patients with severe COVID-19 infection
sce.moderate <- sce[, sce$group == "M"] #subset indicating cells from patients with moderate COVID-19 infection
sce.covid <- sce[, sce$disease == "Y"] #subset indicating cells from patients with COVID-19
sce.control <- sce[, sce$disease == "N"] #subset indicating cells from patients without COVID-19

We can plot two graphs to compare PCA plots between those affected by COVID-19 and a control group:

plotPCA(sce.covid, colour_by = "celltype")

plotPCA(sce.control, colour_by = "celltype")

We can see two different groups of macrophages (in green) that barely overlap between control and COVID-19 groups. Also apparent is the much higher frequency of B, epithelial, and T cells in the COVID-19 plot.

Another comparative PCA plot we can graph is between those affected by severe and moderate COVID-19:

plotPCA(sce.severe, colour_by = "celltype")

plotPCA(sce.moderate, colour_by = "celltype")

The difference between these two plots is less obvious than the last two (COVID vs control). Interestingly, we can see significantly more neutrophil cells (in brown and on the left) expressed in the severe PCA plot. The same is generally true for plasma cells, although this trend is not very conspicuous with PCA plots. However, many cell types do not form distinct clusters when comparing these two groups. These overlapping groups include macrophages, epithelial cells, and T cells.

T-Distributed Stochaistic Neighbor Embedding (T-SNE)

PCA is one way to reduce the high-dimensional data to something we can interpret a little better.

Another dimensionality reduction method is T-SNE, or T-Distributed Stochaistic Neighbor Embedding.

The aim of T-SNE is finding a low-dimensional data representation that preserves the distances between each point in the higher dimension. Practically, this results in the formation of distinct clusters (of cells that are more similar to one in terms of expression profiles).

We can run TSNE on our sce object. Setting the seed is important here, too:

set.seed(00101001101)
sce <- runTSNE(sce, dimred="PCA")

You might have noticed that t-SNE analysis took significantly longer than PCA. This is because it is computationally more intense.

To continue, we can subset our sce object in the same way as we did in the PCA section:

sce.severe <- sce[, sce$group == "S"] #subset indicating cells from patients with severe COVID-19 infection
sce.moderate <- sce[, sce$group == "M"] #subset indicating cells from patients with moderate COVID-19 infection
sce.covid <- sce[, sce$disease == "Y"] #subset indicating cells from patients with COVID-19
sce.control <- sce[, sce$disease == "N"] #subset indicating cells from patients without COVID-19

First we will plot the whole dataset, like in the PCA section. This time, we will color it by celltype and disease (COVID-19 or control):

plotReducedDim(sce, dimred="TSNE", colour_by="celltype")

plotReducedDim(sce, dimred="TSNE", colour_by="disease")

Compared to the PCA plot, the TSNE plot has more defined groupings. For instance, in the first plot (colored by cell type), macrophages are split into three groups. According to the second plot (colored by disease), the right-most grouping (colored in blue) indicates cells, mainly macrophages from the control group. Based on this plot, there is a clear separation (in terms of gene expression profiles) between the macrophages of those who suffer from COVID-19 and the control group.

Along these lines, we can perform a second T-SNE plot using our subsetted sce objects. Specifically, we can compare the plots for severe and moderate COVID-19 patients (colored by cell type):

par(mfrow=c(1,2))
plotReducedDim(sce.severe, dimred="TSNE", colour_by="celltype")

plotReducedDim(sce.moderate, dimred="TSNE", colour_by="celltype")

Although there are a number of similarities between the moderate and severe T-SNE plot, there is one significant difference: macrophage grouping. A majority of the macrophages from the moderate T-SNE plot cluster together around the bottom right (with some around the center), while the overwhelming majority of the macrophages from the severe T-SNE plot center around the bottom to center left (forming a large, distinctly green figure). There is also almost a total absence of plasma cells in the moderate plot, while a small but dense grouping plasma cells cluster around the top of the graph for the severe plot.

Uniform Manifold Approximation and Projection (UMAP)

The final method of dimensionality reduction is UMAP, or Uniform Manifold Approximation and Projection.

In principle, UMAP is similar to T-SNE as it also tries to preserve the distances between cells from the high dimensional space in lower dimensions. Practically, UMAP takes less time than T-SNE, while resulting in more compact clusters than T-SNE.

Make sure to set the seed for UMAP analysis as it also involves algorithms that are based on randomization.

set.seed(1100101001)
sce <- runUMAP(sce, dimred='PCA')

Subsetting is necessary here, too:

sce.severe <- sce[, sce$group == "S"] #subset indicating cells from patients with severe COVID-19 infection
sce.moderate <- sce[, sce$group == "M"] #subset indicating cells from patients with moderate COVID-19 infection
sce.covid <- sce[, sce$disease == "Y"] #subset indicating cells from patients with COVID-19
sce.control <- sce[, sce$disease == "N"] #subset indicating cells from patients without COVID-19

Our first UMAP plot will represent the whole dataset, colored by cell type:

plotReducedDim(sce, dimred="UMAP", colour_by="celltype")

Our second UMAP plot will represent the whole dataset, colored by disease:

plotReducedDim(sce, dimred="UMAP", colour_by="disease")

Based on these two plots, the cells from the control group make a distinct cluster on the right (blue in the second plot). This cluster is comprised mainly of macrophages. Macrophages as a cell type are into 3 groupings in the top plot (echoing the T-SNE plot).

One could say that this UMAP plot echoes both T-SNE and PCA with even more distant, defined clusters.

Our third and fourth plots will use our subsetted data, comparing cells from moderate and severe COVID-19 patients:

par(mfrow=c(1,2))
plotReducedDim(sce.severe, dimred="UMAP", colour_by="celltype")

plotReducedDim(sce.moderate, dimred="UMAP", colour_by="celltype")

We can see a similar pattern in these plots when compared to the respective T-SNE (and PCA) plots: the macrophages of moderate COVID-19 patients generally clustered in a dense grouping in the top right of the plot, while the macrophages of the severe COVID-19 patients were distributed across the top left area of the plot. Plasma cells (yellow) are also seen significantly more in the severe plot.

\[\\[0in]\]

Step 5: Clustering

These groupings are called clusters, and the process of generating them is called clustering.

Clustering can be thought of as a way of viewing data that that is more or less similar to other data. In the case of this dataset, clustering could correspond to cell types or other factors that make cells similar. Similarity, however, can be adjusted while clustering.

We will be exploring two different types of clustering in this document:

Graph-Based Clustering

The first type of clustering we will explore in this document is graph-based clustering.

Graph-based clustering involves building a graph where each cell is connected to its closest neighbors in a higher-dimensional space. These connections, or edges, are weighted based on how similar cells are (more weight is given to similar cells). Then, computation is applied to identify various cell communities or clusters.

To start off, we can use the following code to construct the graph where cells are connected to a certain number of nearest neighbors. I chose 10 but this number can be adjusted).

g <- buildSNNGraph(sce, k=10, use.dimred = 'PCA')
clust <- igraph::cluster_walktrap(g)$membership

We can view clust below:

table(clust)
## clust
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 1615 2335 5083   70  616  775  433 4699 9236 5606  536 9660 1085 2468 2363  409 
##   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32 
## 7265  931  769 2065  617  450  318  424 1460  249  211  277  563  160   41  137 
##   33   34   35   36   37   38 
##   24   20   23   52   23   33

We can assign the cluster assignments into our sce object as metadata, allowing us to visualize our newly-made clusters on our existing dimensionality reduction plots.

We will start with T-SNE:

colLabels(sce) <- factor(clust)
plotReducedDim(sce, "TSNE", colour_by="label")

We can do the same thing with a UMAP/PCA plot:

plotReducedDim(sce, "UMAP", colour_by="label")

It is even possible to visualize the nearest-neighbor graph:

set.seed(2000)
reducedDim(sce, "force") <- igraph::layout_with_fr(g)
plotReducedDim(sce, colour_by="label", dimred="force")

In all three of these plots, the graph-based clustering method we used determined that there were 38 clusters. This is not to say that there are 38 biologically defined groups; there aren’t.

Although the color scheme and sheer number of clusters make it difficult to distinguish between certain groups, cluster #1 appears in all areas except for the top right grouping (which, as we saw in T-SNE, was the control group). The majority of that top right grouping also is a color that is not seen in other parts of the graph.

K-Means Clustering

The second type of clustering we will explore in this document is k-means clustering.

K-means clustering allows the user to choose the number of clusters. The user can control this by choosing the number of cluster centers.

Suppose we pick 10 centers, or centroids. The K-means algorithm selects the location of these centroids as the initial cluster centers. It then keeps refining the location of the cluster centers to match the points better and better (minimizing the within-cluster sum of squares).

In other words, K-means clustering involves initial random selection of centroids but the aim is reduction of within-cluster variances.

We can start k-means analysis with 10 centers or clusters:

set.seed(100)
clust.kmeans <- kmeans(reducedDim(sce, "PCA"),centers=10)
table(clust.kmeans$cluster)
## 
##     1     2     3     4     5     6     7     8     9    10 
##  1491 10001 11260  8106  7403  3311  9626  1332  4030  6541

We can plot our T-SNE dimensional reduction using our newly generated clusters as the labels, in a similar fashion to graph-based clustering:

colLabels(sce) <- factor(clust.kmeans$cluster)
plotReducedDim(sce, "TSNE", colour_by="label")

An intentional method used by some people is setting k to a large enough value to avoid overclustering.

When comparing the K-means clustering map above to the T-SNE plots we did in Step 3, we can see that Cluster 9 roughly corresponds to macrophages from moderate COVID-19 patients. Cluster 6 approximately corresponds to epithelial cells and plasma cells from severe COVID-19 patients. Cluster 4 roughly outlines T-cells from both moderate and severe COVID-19 patients. Cluster 8 could correspond to neutrophils found in severe COVID-19 patients.

Clusters 3, 7, and 10 correspond to the large group of macrophages from severe COVID-19 patients. Clusters 2 and 5 roughly correspond to the macrophages from the control group.

Bonus: Combining K-means and Graph-Based Clustering

Here is a more intense clustering method involving two steps. It is essentially a combination of initial k-means clustering followed by graph-based clustering:

set.seed(0101010)
kgraph.clusters <- clusterSNNGraph(sce, use.dimred="PCA", use.kmeans=TRUE, kmeans.centers=1000, k=5)

We can plot a T-SNE dimensional reduction with the colors as clusters:

plotTSNE(sce, colour_by=I(kgraph.clusters))

When comparing this combined map to the T-SNE plots, we get a similar breakup of clusters as what is seen above with pure K-means clustering. For instance, Cluster 5 and 3 correspond to the macrophages from severe COVID-19 patients. Also, Clusters 2, 4, and 11 correspond to the macrophages from the control group.

The one major difference that is apparent is that the area indicated by Cluster 4 from the K-means map (which roughly corresponds to T cells), is split up into at least four clusters in this plot. All of these clusters seem to correspond to T-cells; in other words, no cluster represents a distinct cell type and/or condition (control, moderate, or severe).

\[\\[0in]\]

Takeaways

Based on our dimensionality reduction plots (PCA, T-SNE, UMAP), all three showed a significant difference (expressed by distance, in the graphs) in the expression profiles of macrophages when comparing COVID-19 patients and the control group. A significant difference was even apparent when comparing macrophages from moderate and severe COVID-19 patients.

In the abstract of the paper associated with this data, the authors stated two main conclusions:

Based on our analysis in Steps 3 and 4, we observed an abundance of macrophages associated with severe COVID-19 patients (supporting the paper’s first conclusion). In fact, these macrophages were often grouped together (mostly visible in T-SNE and UMAP plots).

A final note: the pattern identification we are doing in this project is not definitive analysis. Rather, it enables us to look at the same data in different ways, approaching it from a different perspective.

If you have any feedback on this project, please email .

\[\\[0in]\]

References

  1. Single Cell Gene Expression - 10x Genomics. [accessed 2020Aug.13]. https://www.10xgenomics.com/products/single-cell-gene-expression/

  2. Single-cell landscape of bronchoalveolar immune cells in patients . [accessed 2020Aug.13]. https://www.nature.com/articles/s41591-020-0901-9

  3. Orchestrating Single-Cell Analysis with Bioconductor. [accessed 2020Aug.13]. https://osca.bioconductor.org/

  4. scRNA-Seq. [accessed 2020Aug.13]. https://sapac.illumina.com/science/sequencing-method-explorer/kits-and-arrays/scrna-seq.html

  5. hemberg-lab/scRNA.seq.course. [accessed 2020Aug.13]. https://scrnaseq-course.cog.sanger.ac.uk/website/index.html

  6. Single-cell RNA sequencing technologies and bioinformatics pipelines. [accessed 2020Aug.13]. https://www.nature.com/articles/s12276-018-0071-8

  7. Understanding K-means Clustering in Machine Learning | by Dr . [accessed 2020Aug.13]. https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6e67336aa1